Statistics for Political Science
August 4, 2025
This class is designed to give you the skills to pursue your own independent research projects in the future; don’t worry about writing the perfect paper.
\[ \underbrace{\text{Description}}_{\text{This week}} \;+\; \underbrace{\text{Inference}}_{\text{Next week}} \;=\; \underbrace{\text{Regression}}_{\text{Where the magic happens}} \]
“Correlation doesn’t equal causation”
…but all our statistical evidence about causation is built on correlations.
We need to understand the basics of statistical correlations:
We will introduce quite a lot of notation in the next few weeks. We have tried to keep notation to a minimum, but it is often necessary to communicate complex ideas quickly and precisely.
Notation today:
Quantitative political science seeks to understand political phenomena through numerical representation.
“Different states are debating when, if at all, abortion should be legal during a woman’s pregnancy. A normal pregnancy could go up to as many as 40 weeks. Until what point in a pregnancy do you think a woman should be legally allowed to obtain an abortion?”
Respondents choose number of weeks (0–40)
Fundamental tension in quantitative analysis: detail vs parsimony.
Compare averages across categories:
Summarize relationships across multiple continuous variables:
Central tendency refers to measures that identify the center of a dataset.
Continuous and ordinal variables:
Categorical variables: report as tables or recode in binary.
¹ This value is often meaningless for continuous variables, so is rarely included.
\[ \bar{Y} = \frac{\sum_{i=1}^n Y_i}{n} \]
Components:
1. Zero-Sum Property
\[ \sum_{i=1}^n (Y_i - \bar{Y}) = 0 \]
2. Least-Squares Property
\[
\sum_{i=1}^n (Y_i - \bar{Y})^2 \;<\;
\sum_{i=1}^n (Y_i - c)^2
\quad \forall\; c \neq \bar{Y}
\]
The median is the middle value of a variable when the observations are ordered from smallest to largest.
It divides the distribution into two equal halves: 50% of values are below the median and 50% are above it.
Key advantage: The median is resistant to outliers and skewed data.
Imagine there are 5 people sitting in a bar:
Imagine that Elon Musk walks into the bar:
Variance is the typical (squared) distance between an observation and the mean:
\[ \mathrm{var}(Y) = s^2 = \frac{\sum_{i=1}^n (Y_i - \bar{Y})^2}{n} \]
\[ \mathrm{sd}(Y) = s = \sqrt{\frac{\sum_{i=1}^n (Y_i - \bar{Y})^2}{n}} \]
Assuming a normal distribution:
\[
\mathrm{cov}(X, Y) = \frac{\sum_{i=1}^n (X_i - \bar{X})(Y_i - \bar{Y})}{n}
\] - If \(X\) and \(Y\) tend to increase together, most terms are positive → positive covariance.
- If there’s no clear relationship, positives and negatives cancel → covariance ≈ 0.
cov(Age, Weeks of Abortion) = -18.4
\[ \mathrm{cor}(X,Y) = \frac{\mathrm{cov}(X,Y)}{\mathrm{sd}(X)\,\mathrm{sd}(Y)} \]
cor(Age, Weeks of Abortion) = -0.086
Histograms allow us to visualize the distribution of a single continuous variable.
Box plots help us compare the distributions of a continuous variable across categories.
We use scatter plots to visualize the relationship between two continuous variables: